32 research outputs found
Supervised Learning in Spiking Neural Networks with Phase-Change Memory Synapses
Spiking neural networks (SNN) are artificial computational models that have
been inspired by the brain's ability to naturally encode and process
information in the time domain. The added temporal dimension is believed to
render them more computationally efficient than the conventional artificial
neural networks, though their full computational capabilities are yet to be
explored. Recently, computational memory architectures based on non-volatile
memory crossbar arrays have shown great promise to implement parallel
computations in artificial and spiking neural networks. In this work, we
experimentally demonstrate for the first time, the feasibility to realize
high-performance event-driven in-situ supervised learning systems using
nanoscale and stochastic phase-change synapses. Our SNN is trained to recognize
audio signals of alphabets encoded using spikes in the time domain and to
generate spike trains at precise time instances to represent the pixel
intensities of their corresponding images. Moreover, with a statistical model
capturing the experimental behavior of the devices, we investigate
architectural and systems-level solutions for improving the training and
inference performance of our computational memory-based system. Combining the
computational potential of supervised SNNs with the parallel compute power of
computational memory, the work paves the way for next-generation of efficient
brain-inspired systems
A Hardware-Oriented Dynamically Adaptive Disparity Estimation Algorithm and its Real-Time Hardware
The computational complexity of disparity estimation algorithms and the need of large size and bandwidth for the external and internal memory make the real-time processing of disparity estimation challenging, especially for High Resolution (HR) images. This paper proposes a hardware-oriented adaptive window size disparity estimation (AWDE) algorithm and its real time reconfigurable hardware implementation that targets HR video with high quality disparity results. The proposed algorithm is a hybrid solution involving the Sum of Absolute Differences and the Census cost computation methods to vote and select the best suitable disparity candidates. It utilizes a pixel intensity based refinement step to remove faulty disparity computations. The AWDE algorithm dynamically adapts the window size considering the local texture of the image to increase the disparity estimation quality. The proposed reconfigurable hardware of the AWDE algorithm enables handling 60 frames per second on Virtex-5 FPGA at a 1024Ă768 XGA video resolution for a 120 pixel disparity range
ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for Deep Learning
Deep neural networks (DNNs) have surpassed human-level accuracy in a variety
of cognitive tasks but at the cost of significant memory/time requirements in
DNN training. This limits their deployment in energy and memory limited
applications that require real-time learning. Matrix-vector multiplications
(MVM) and vector-vector outer product (VVOP) are the two most expensive
operations associated with the training of DNNs. Strategies to improve the
efficiency of MVM computation in hardware have been demonstrated with minimal
impact on training accuracy. However, the VVOP computation remains a relatively
less explored bottleneck even with the aforementioned strategies. Stochastic
computing (SC) has been proposed to improve the efficiency of VVOP computation
but on relatively shallow networks with bounded activation functions and
floating-point (FP) scaling of activation gradients. In this paper, we propose
ESSOP, an efficient and scalable stochastic outer product architecture based on
the SC paradigm. We introduce efficient techniques to generalize SC for weight
update computation in DNNs with the unbounded activation functions (e.g.,
ReLU), required by many state-of-the-art networks. Our architecture reduces the
computational cost by re-using random numbers and replacing certain FP
multiplication operations by bit shift scaling. We show that the ResNet-32
network with 33 convolution layers and a fully-connected layer can be trained
with ESSOP on the CIFAR-10 dataset to achieve baseline comparable accuracy.
Hardware design of ESSOP at 14nm technology node shows that, compared to a
highly pipelined FP16 multiplier design, ESSOP is 82.2% and 93.7% better in
energy and area efficiency respectively for outer product computation.Comment: 5 pages. 5 figures. Accepted at ISCAS 2020 for publicatio
Accurate deep neural network inference using computational phase-change memory
In-memory computing is a promising non-von Neumann approach for making
energy-efficient deep learning inference hardware. Crossbar arrays of resistive
memory devices can be used to encode the network weights and perform efficient
analog matrix-vector multiplications without intermediate movements of data.
However, due to device variability and noise, the network needs to be trained
in a specific way so that transferring the digitally trained weights to the
analog resistive memory devices will not result in significant loss of
accuracy. Here, we introduce a methodology to train ResNet-type convolutional
neural networks that results in no appreciable accuracy loss when transferring
weights to in-memory computing hardware based on phase-change memory (PCM). We
also propose a compensation technique that exploits the batch normalization
parameters to improve the accuracy retention over time. We achieve a
classification accuracy of 93.7% on the CIFAR-10 dataset and a top-1 accuracy
on the ImageNet benchmark of 71.6% after mapping the trained weights to PCM.
Our hardware results on CIFAR-10 with ResNet-32 demonstrate an accuracy above
93.5% retained over a one day period, where each of the 361,722 synaptic
weights of the network is programmed on just two PCM devices organized in a
differential configuration.Comment: This is a pre-print of an article accepted for publication in Nature
Communication
Scale up your In-Memory Accelerator: leveraging wireless-on-chip communication for AIMC-based CNN inference
Analog In-Memory Computing (AIMC) is emerging as a disruptive paradigm for heterogeneous computing, potentially delivering orders of magnitude better peak performance and efficiency over traditional digital signal processing architectures on Matrix-Vector multiplication. However, to sustain this throughput in real-world applications, AIMC tiles must be supplied with data at very high bandwidth and low latency; this poses an unprecedented pressure on the on-chip communication infrastructure, which becomes the system's performance and efficiency bottleneck. In this context, the performance and plasticity of emerging on-chip wireless communication paradigms provide the required breakthrough to up-scale on-chip communication in large AIMC devices. This work presents a many-tile AIMC architecture with inter-tile wireless communication that integrates multiple heterogeneous computing clusters, embedding a mix of parallel RISC-V cores and AIMC tiles. We perform an extensive design space exploration of the proposed architecture and discuss the benefits of exploiting emerging on-chip communication technologies such as wireless transceivers in the millimeter-wave and terahertz bands.This work was supported by the WiPLASH project (g.a. 863337), founded from the European Unionâs Horizon 2020 research and innovation program.Peer ReviewedPostprint (author's final draft
Mixed-precision deep learning based on computational memory
Deep neural networks (DNNs) have revolutionized the field of artificial
intelligence and have achieved unprecedented success in cognitive tasks such as
image and speech recognition. Training of large DNNs, however, is
computationally intensive and this has motivated the search for novel computing
architectures targeting this application. A computational memory unit with
nanoscale resistive memory devices organized in crossbar arrays could store the
synaptic weights in their conductance states and perform the expensive weighted
summations in place in a non-von Neumann manner. However, updating the
conductance states in a reliable manner during the weight update process is a
fundamental challenge that limits the training accuracy of such an
implementation. Here, we propose a mixed-precision architecture that combines a
computational memory unit performing the weighted summations and imprecise
conductance updates with a digital processing unit that accumulates the weight
updates in high precision. A combined hardware/software training experiment of
a multilayer perceptron based on the proposed architecture using a phase-change
memory (PCM) array achieves 97.73% test accuracy on the task of classifying
handwritten digits (based on the MNIST dataset), within 0.6% of the software
baseline. The architecture is further evaluated using accurate behavioral
models of PCM on a wide class of networks, namely convolutional neural
networks, long-short-term-memory networks, and generative-adversarial networks.
Accuracies comparable to those of floating-point implementations are achieved
without being constrained by the non-idealities associated with the PCM
devices. A system-level study demonstrates 173x improvement in energy
efficiency of the architecture when used for training a multilayer perceptron
compared with a dedicated fully digital 32-bit implementation
Benchmarking energy consumption and latency for neuromorphic computing in condensed matter and particle physics
The massive use of artificial neural networks (ANNs), increasingly popular in
many areas of scientific computing, rapidly increases the energy consumption of
modern high-performance computing systems. An appealing and possibly more
sustainable alternative is provided by novel neuromorphic paradigms, which
directly implement ANNs in hardware. However, little is known about the actual
benefits of running ANNs on neuromorphic hardware for use cases in scientific
computing. Here we present a methodology for measuring the energy cost and
compute time for inference tasks with ANNs on conventional hardware. In
addition, we have designed an architecture for these tasks and estimate the
same metrics based on a state-of-the-art analog in-memory computing (AIMC)
platform, one of the key paradigms in neuromorphic computing. Both
methodologies are compared for a use case in quantum many-body physics in two
dimensional condensed matter systems and for anomaly detection at 40 MHz rates
at the Large Hadron Collider in particle physics. We find that AIMC can achieve
up to one order of magnitude shorter computation times than conventional
hardware, at an energy cost that is up to three orders of magnitude smaller.
This suggests great potential for faster and more sustainable scientific
computing with neuromorphic hardware.Comment: 7 pages, 4 figures, submitted to APL Machine Learnin
Neuromorphic computing using non-volatile memory
Dense crossbar arrays of non-volatile memory (NVM) devices represent one possible path for implementing massively-parallel and highly energy-efficient neuromorphic computing systems. We first review recent advances in the application of NVM devices to three computing paradigms: spiking neural networks (SNNs), deep neural networks (DNNs), and âMemcomputingâ. In SNNs, NVM synaptic connections are updated by a local learning rule such as spike-timing-dependent-plasticity, a computational approach directly inspired by biology. For DNNs, NVM arrays can represent matrices of synaptic weights, implementing the matrixâvector multiplication needed for algorithms such as backpropagation in an analog yet massively-parallel fashion. This approach could provide significant improvements in power and speed compared to GPU-based DNN training, for applications of commercial significance. We then survey recent research in which different types of NVM devices â including phase change memory, conductive-bridging RAM, filamentary and non-filamentary RRAM, and other NVMs â have been proposed, either as a synapse or as a neuron, for use within a neuromorphic computing application. The relevant virtues and limitations of these devices are assessed, in terms of properties such as conductance dynamic range, (non)linearity and (a)symmetry of conductance response, retention, endurance, required switching power, and device variability.11Yscopu